Generalizing a Strongly Lexicalized Parser using Unlabeled Data
نویسندگان
چکیده
Statistical parsers trained on labeled data suffer from sparsity, both grammatical and lexical. For parsers based on strongly lexicalized grammar formalisms (such as CCG, which has complex lexical categories but simple combinatory rules), the problem of sparsity can be isolated to the lexicon. In this paper, we show that semi-supervised Viterbi-EM can be used to extend the lexicon of a generative CCG parser. By learning complex lexical entries for low-frequency and unseen words from unlabeled data, we obtain improvements over our supervised model for both indomain (WSJ) and out-of-domain (questions and Wikipedia) data. Our learnt lexicons when used with a discriminative parser such as C&C also significantly improve its performance on unseen words.
منابع مشابه
Applying Co-Training Methods to Statistical Parsing
We propose a novel Co-Training method for statistical parsing. The algorithm takes as input a small corpus (9695 sentences) annotated with parse trees, a dictionary of possible lexicalized structures for each word in the training set and a large pool of unlabeled text. The algorithm iteratively labels the entire data set with parse trees. Using empirical results based on parsing the Wall Street...
متن کاملFeature Engineering in Persian Dependency Parser
Dependency parser is one of the most important fundamental tools in the natural language processing, which extracts structure of sentences and determines the relations between words based on the dependency grammar. The dependency parser is proper for free order languages, such as Persian. In this paper, data-driven dependency parser has been developed with the help of phrase-structure parser fo...
متن کاملDisambiguation of Morphological Structure using a PCFG
German has a productive morphology and allows the creation of complex words which are often highly ambiguous. This paper reports on the development of a head-lexicalized PCFG for the disambiguation of German morphological analyses. The grammar is trained on unlabeled data using the Inside-Outside algorithm. The parser achieves a precision of more than 68% on difficult test data, which is 23% mo...
متن کاملLearning Head-modifier Pairs to Improve Lexicalized Dependency Parsing on a Chinese Treebank
Due to the data sparseness problem, the lexical information from a treebank for a lexicalized parser could be insufficient. This paper proposes an approach to learn head-modifier pairs from a raw corpus, and to integrate them into a lexicalized dependency parser to parse a Chinese Treebank. Experimental results show that this approach not only enlarged the coverage of bi-lexical dependency, but...
متن کاملMultiword Units in Syntactic Parsing
The influence of MWU recognition on parsing accuracy is investigated with respect to a deterministic dependency parser based on memory-based learning. The effect of a perfect MWU recognizer is simulated through manual annotation of MWUs in corpus data used for training and testing two different versions of the parser, one which is lexicalized and one which is not. The results show a significant...
متن کامل